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Abstract 

Motivation:It is well known that the development of cancer is caused 
by the accumulation of somatic mutations in oncogenes and tumor sup- 
pressors within the genome. For oncogenes specifically, current research 
suggests that there is a small set of "driver" mutations that are primarily 
responsible for tumorigenesis. Further, due to some recent pharmaco- 
logical successes in treating these driver mutations and their resulting 
tumors, a variety of methods have been developed to identify potential 
driver mutations using methods such as machine learning and mutational 
clustering. We propose a novel methodology that increases our power to 
identify mutational clusters by taking into account protein tertiary struc- 
ture via a graph theoretical approach. 

Results: We have designed and implemented a novel algorithm, Graph- 
PAC (Graph Protein Amino acid Clustering), to identify mutational 
clustering while considering protein spatial structure. Using GraphPAC , 
we are able to detect novel clusters in proteins that are known to ex- 
hibit mutation clustering as well as identify clusters in proteins without 
evidence of prior clustering based on current methods. Specifically, by uti- 
lizing the spatial information available in the Protein Data Bank (PDB) 
along with the mutational data in the Catalogue of Somatic Mutations in 
Cancer (COSMIC), GraphPAC identifies new mutational clusters in well 
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known oncogenes such as EGFR and KRAS. Further, by utilizing graph 
theory to account for the tertiary structure, GraphPAC identifies clusters 
in DPP4, NRP1 and other proteins not identified by existing methods. 

Availability: R package available on Bioconductor at: 

http: //www.bioconductor . org/packages/2 . 12/bioc/html/GraphPAC .html 

Contacts: gregory.ryslik@yale.edu; hongyu.zhao@yale.edu 



1 Introduction 



Cancer, while one of the most widespread and heterogeneous diseases, is at its 
most fundamental a disease brough t on by the accumulation of somatic muta- 
tions (jVogelstein and Kinzleii |2004) . These mutations typically occur in either 
tumor suppressors or in oncogenes. While oncogenic mutations either tend 
to deregulate or up-regulate the resulting protein behavior, mutations within 
tumor suppressors typically lower the activity of genes that prevent cancer. 
Moreover, pharmacological intervention has shown to be more effective with 
inhibiting activating oncogenes than with restoring functionality of tumor sup- 
pressing genes. Combined with the theory of "oncogene addiction", that many 
cancers are dependent upon a small set of key genes to drive their rapid cellular 
multiplication with the r est of the mutation s simp ly being passenger mutations 



( Greenman et all , l2007t IWeinstein and Joel 120061 ). the identification of driver 



oncogenic mutations has become of critic a l importance in cancer rese arch. 

Re cent studies bv lBardelli et al. ( 2003 ); Greenman et al. ( 2007 ) and Torkamani and Schork 
(2008) show that activatin g soma tic mutations tend to cluster within protein 
kinases. Similarly. IWagner ( 2007 ) postulates that mutational clustering which 
leads to protective or harmful phenotypic changes can be indicative of positive 
selection and regions that are functionally significant. In turn, these functionally 
significant regions repre sent potential si tes fo r protein eng i neerin g. To further 
support the hypothesis, Ye et al. (2010) and Rvslik et al. (120131) showed that 
mutational clusters can be indicative of activating mutations and that finding 
such clusters is a way to reduce the driver mutation search space needing to be 
analyzed. 

Due to the importance of this problem, several approaches have been pro- 
posed to detect naturally selected regions as well as regions in which potential 
activating mutations occur. One such approach postulates that driver mutations 
will h ave a larger non-synonymous mutation ra te than the background mutation 
rate ( Siblom et al. . 2006j Bardelli et al. . 2003 ). Similarly, based on the hypoth- 
esis that the neutral rate of nucleotide substitution is surpassed when positive 
selection is acting on a certain region, one can check if the ratio of nonsynony- 
mous (K a ) to synonymo us (K s ) muta tions per site is greater than 1 (jKreitmanl . 
2000l ). Correspondingly. IWansJ ( 2002 ) postulates that driver mutations can be 
identified by a higher mutational rate as compared to the background rate after 
normalizing for the length of the gene. 

Researchers have also focused on creating classifiers in order to categorize 
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muta tions. Machine learning algorithms such as Polyphen-2 ([Adzhubei et al. 
20101). which classifi es whether a missense mutation is damaging, and CHASM 



(|Carter et a71 . l2009h . which discriminates between known driver mutations and 



a set of passenger mutations, rely upon a set of rules that are developed from 
a training set. The rules, devel oped using a va riety of machine learning tech- 
niques such as Random Fo rests (jBreimanl . 120011 ) and Support Vector Machines 
(jCortes and Vapnikl . 1 1995T) , are then used to calculate a score for each mutation. 
These scores are calculated using both sequence and non-sequence features such 
as evolutionary conservatio n, size and polarity of the substituted residue as well 



as accessible surface area (Reva et al. 



hand, such as SIFT, (|Ng and Henikofi 



20111) . Other classifiers on the other 



20011 ). use only evolutionary conserva- 



tion in order to predict functional change and whether the change is damaging 
or tolerated. 

While the methods based upon mutational background rates have had some 
success in identifying regions of positive selections or driver mutations, they 
nonetheless suffer from several shortcomings. Many of these methods rely upon 
calculating the difference between synonymous and non-synonymous mutations 
but do not take into account that selection can act upon minute regions of the 
gene. Thus, when the mutations rates are averaged over t he entire gene , the 
signal may be lo st. Furthermore, the methods proposed by IWand (120021) and 
Kreitman (2000) do not differentiate between activating and non-activating non- 
synonymous mutations. Alternatively, many of the machine classifier methods 
such as Polyphen-2 and CHASM require an extensive rule set that must first 
be trained using a well-detailed and annotated database of information. Until 
the requisite literature and information is developed, the machine learning algo- 
rithm is unable to create a well-performing classifier. Furthermore, the rule-set 
must be updated periodically to account for newly discovered information and 
modifications. 

Building on the work of iTorkamani and Schorkl ( 2008t ) and Bardelli et al 



( 20031) , which st ipulated that on ly a small number of specific mutations can ac- 
tivate a protein, Ye et all ( 2010l ) developed Non- Random Mutational Clustering 
(NMC) in order to identify potential activating mutations. NMC works on the 
hypothesis that absent any previously known mutational hotspot, a mutational 
cluster is thus indicative of a possible activating mutation. Assuming the null hy- 
pothesis that mutation locations follow a uniform distribution when the protein 
is considered in linear form, NMC identifies clustering by calculating whether 
any two pair-wise mutations are closer together on the line than expected by 
random chance. While this was shown to be fairly successful, NMC is limited by 
the fact that it considers the protein as a linear sequence an d does not take into 
account the tertiary protein structure. To account for this, Rvslik et all ( 2013| ) 
developed iPAC (identification of Protein Amino acid Clustering). By utiliz- 
ing Multidimensional Scaling (MDS), IP AC reorganizes the protein into a one 
dimensional space while preserving, as best as possible, the three dimensional 
amino acid pairwise distances. While a significant improvement over NMC, the 
reliance upon MDS can potentially result in a distorted rearrangement of the 
protein, since even distant residues can have an impact on each other's final 
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position in one dimensional space. 

In this manuscript, we provide an alternative method to iPAC by remapping 
the protein into one dimensional space via a graph theoretic approach. This 
approach allows for a more natural consideration of the protein that is sensitive 
to protein domains and linkers. We go on to show that our methodology is 
effective in identifying proteins with mutational clustering that are missed by 
both iPAC and NMC such as NRP1 and MAPK24. We also show that for 
some proteins, GraphPAC identifies less clusters than inferred by both iPAC 
and NMC while for other proteins GraphPAC identifies more clustering than 
inferred by the other two methods. While both GraphPAC and iPAC are an 
improvement over NMC since they account for tertiary structure, the differences 
between GraphPAC and iPAC point to the fact that different rearrangements 
of the protein must be considered in order to better understand the mutational 
clustering landscape. Finally, we proceed to show that many of the clusters 
identified by GraphPAC are also classified as damaging by Polyphen-2 and as 
an activating mutation by CHASM. In the end, by providing a more complete 
picture of mutational clustering than can be provided by iPAC or NMC alone, 
we are able to obtain a more accurate landscape of where potential activating 
mutations may occur on the protein. 



2 Methods 



GraphPAC uses a four step approach to identify mutational clusters. The first 
step, as described in Sections 12.1 1 and 12.21 obtains mut ational and po s itiona l 
data from COSMIC (|Forbes et a l, 2008) and the PDB (jBerman et all l2000h . 



respectively. After reconciling the mutational and positional databases (Section 
2.3|) . the residues are realized as a connected graph and the traveling salesman 
problem is heuristically solved in order to find the shortest path through the 
protein (Section |2.4[) . Once the shortest path has been identified, the protein 
residues are reordered along this path providing a one dimensional ordering 
of the protein. The linear NMC algorithm is then used to calculate which 
mutations are closer together than expected by chance. Lastly, the clusters are 
unmapped back into the original space and the results reported back to the 
user. We detail each of the steps in the sections below. 



2.1 Obtaining Mutational Data 

The mutational positions were obtained from the 58th version of the COSMIC 
database that was downloaded via the following ftp site: ftp . Sanger . ac . uk/pub/CGP/ cosmic 
The database was implemented locally using Oracle llg. Only missense muta- 
tions that were classified as "Confirmed somatic variant" or "Reported in an- 
other cancer sample as somatic" were selected, with nonsense and synonymous 
mutations excluded. Moreover, all the mutations originated from studies that 
were classified as whole gene screens in order to justify the assumption that 
amino acids follow a uniform distribution of mutation. Next, since multiple 
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studies can report mutational data from the same cell line, mutational redun- 
dancies were removed to avoid double countin g the mutations. L astly, only the 
proteins with a UniProt Accession Number ( Consortium . 201 lh were kept in 
order to correctly match the mutational and positional data. See "COSMIC 
query" in the supplementary information for the SQL code required to generate 
the mutational data. 



2.2 Obtaining the 3D Structural Data 

The PDB web interface was used to obtain the protein tertiary information. 
Since multiple structures are often available for the same protein, all structures 
with a matching UniProt Accession Number were used and an appropriate mul- 
tiple comparisons adjustment (see Section I2.6|) was performed afterwards. For 
proteins where the resolution provided alternative conformations, the first con- 
formation listed in the file was used. Similarly, for structures where more than 
one polypeptide chain with a matching Uniprot Accession Number was avail- 
able, the first matching chain listed in the file was used (typically chain A). 
Finally, after the side-chain and conformation are selected, the cartesian coor- 
dinates of all the a-carbon atoms are used to represent the tertiary backbone 
structure of the protein. See "Structure Files" in the supplementary materials 
for a full listing of all the 1,904 structure/side chain combinations used. 



2.3 Reconciling the Structural and Mutational Data 

In order to reference the same residue in the COSMIC and PDB databases, an 
alignment was performed to accommodate their different numbering systems. 
Like iPAC, GraphPAC allows two such reconcil i ations . The first is based upon a 



pairwise alignment as described in lPages et all (|2012r ) while the second is based 



upon a numerical reconstruction from the structural information available in 
the PDB file. Due to the fact that the PDB file structure potentially changes 
depending upon the structure release date along with other technical complica- 
tions, pairwise alignment was used for all the analysis described in this paper 
unless specifically noted. For further information on the alignment please see 
the documentation in the GraphPAC package available on Bioconductor. Sim- 
ilar to iPAC, a successful alignment of the tertiary and mutational data was 
obtained for 140 proteins corresponding to 1100 unique structure/side-chain 
combinations. The "Structure Files" in the supplementary materials provide a 
full listing and description. 



2.4 Traveling Salesman Approach 



Since the NMC algorithm requires order statistics to identify clustering (see 
Section |2.5|) . we need to map the protein from a three dimensional to a one 
dimensional space. Since ID space is well-ordered, order statistics can then 
be construc ted. Contrary to iPAC, which employed Multidimensional Scal- 
ing (MDS) (|Borg and Groenenl . 119971) . a graph theoretic approach is used by 
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GraphPAC . This is due to the fact that in MDS, the minimization of the stress 
function results in every residue having an effect on the final position of every 
other residue. This approach may not capture that a protein is typically com- 
prised of several domains and that only residues within a specific domain should 
influence each other's final position in linear space (see Figure [TJ. 
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Figure 1: An example protein with three different domains. Under iPAC, the residues 
in Domain A will have an effect on the final position on the residues in Domain C and vice 
versa. 



Under the GraphPA C algorithm, we first construct a complete graph where 
each residue is a vertex in the graph. We then create a linear ordering of the 
protein by finding a Hamiltonian path through the graph. As the number of 
distinct Hamiltonian paths on a graph with N vertices is equal to j a di- 

rect consideration of all possible paths is computationally unfeasible. Further, 
selective pruning of the edges based upon edge distance is also often impractical 
due to the domain structure where many residues are close to each other. Due 
to these factors, we u se a heuris t ic alg orithm that solves the Traveling Sales- 
man Problem (TSP ) (fApplegatd . [2006li . Specifically, by solving the TSP (see 



Hahsler and Hornikl (|2007f )h we find a linear path through the protein that is 



approximate of the shortest path through the protein. We then use this path 
as a representative reordering of the protein into one dimensional space from 
which we can then identify clusters. Unlike iPAC, this methodology remaps the 
protein into one dimensional space but takes into account only locally neigh- 
boring residues. Amino acids on opposite ends of the protein have no effect on 

each other's position. 

While there are many heuristic solutions (see lGutin an d Punne n1(l2007h). wo 



consider 3 of the most common insertion algorithms (jRosenkrantz et aLl . ll977l ): 
cheapest insertion, farthest insertion and nearest insertion. Specifically, the 
objective of the TSP is to find a cyclic permutation it of {1, 2, 3, . . . , n} that 
minimizes the total tour distance, namely: 

n 

min y d(i,Tr(i)) 

TT ' 

1=1 

Here, d(i,j) represents the distance between residues i and j (with d(i,i) = 0) 
and 7r(«) represents the residue that follows residue i on the tour. The difference 
between the three insertion methods rests on how the next k is selected. Under 
cheapest insertion, the next k to be inserted into the tour is chosen such that the 



G 



increase in tour length is minimal. Under nearest insertion, at each iteration, 
the k that is closest to a residue already on the tour is selected. Finally, under 
farthest insertion, the k that is farthest away from any residue already on the 
tour is selected. 

These algorithms have different upper bounds on their tour lengths. For in- 
stance, the farthest insertion algorithm creates tours that approach | of the 
shortest length while the nearest and cheapest insertion algorithms can be 
linked to the minimal spanning tree algorithm and thus have an upper bound of 
twice the shortest tour leng th when distances satisfy the triangular inequality 
( Hahsler and Horrnkl . 20071 ). Due to the varied nature of these methods and 



that there is no biological justification to favor one over the other, we consider 
all three methods when identifying clusters and then perform an appropriate 
multiple comparison adjustment (see Section [276]) . 

As can be seen from Figure [5J all the rearrangement options present a pos- 
itive skew and are mostly consistent with each other. For the majority of the 
proteins, all three insertion approaches as well as the MDS approach result in 
little rearrangement. However, if one method results in radical rearrangement 
when the protein is mapped to ID space, the other methods do so as well. This 
makes selection of a specific insertion method less critical and for the rest of 
this manuscript, unless otherwise specified, we use the insertion method with 
the most significant cluster for analysis. Please see "Distribution Summary" in 
the supplementary materials for a full listing of each structure's Kendall Tau 
distance, protein index and a high resolution plot. 
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Figure 2: The amount of rearrangement performed under each of the three insertion 
methods described as well as MDS. Each column on the x-axis represents one of the 1100 
structures considered, with structures from the same protein adjacent to one another and the 
protein order determined lexicographically by protein name. The y-axis shows the Kendall 
Tau distance, which is equivalent to the number of swaps required to sort the protein back 
into {1,2,3,. . . ,..} order using bubble sort. The spikes that occur represent the DPP4, IDE, 
MET, PIK3Co, SEC23A and TF proteins, from left to right, respectively. 



2.5 NMC 

The NMC algorithm as described by I Ye et ~al. ( 2010h . and briefly reviewed here, 
was used to find the mutational clusters once the protein was remapped to ID 
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space. To begin, suppose we had m samples of a protein that was N residues 
long and that there were a total of n mutations over all m proteins. As shown in 
Figure[3J by collapsing over the m samples, we can construct order statistics for 
every mutations. We then define a cluster to exist if Prj Cki = X( k ^ — X(i )) < a, 
for some predetermined significance level a. As shown in I Ye et a t (2010), while 
a closed form calculation of the above probability is possible, it often becomes 
computationally costly. To overcome this, we calculate -M and assume that the 
statistic is uniform on (0, 1 

X, 



(fc) 



Then in limit, it can be shown that: 

-X, 



N 



<c) 



..k-i-l 



(i - y y +n - k d y 



o (k-i- l)\(i + n-k)r 
PriBetaik - i, i + n - k + 1) < c) 



(1) 



The above calculation is then performed on all pairwise mutations and then 
an appropriate multiple comparison adjustment is then applied. For the remain- 
der of this study, we use the more conservative Bonferroni correction to adjust 
for the within-protein cluster p- values. See Section l2~6l for a description of how 
we account for the intra-protein multiple comparisons. Lastly, it is important 
to mention that the structural information obtained for each protein does not 
always contain the (x,y,z) coordinates for ever residue in the protein. In such 
cases, in order to compare GraphPAC, iPACand NMC on an equal basis, these 
missing residues are removed from the protein. 
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Figure 3: An example constructing order statistics over 3 samples with 7 total mutations. 
The number inside the box indicates the residue number. A "*" above a resid ue signifies a 
non-sy nonymous missense substitution mutation for that residue. Figure from IRvslik et all 
( |2013h . 



2.6 Multiple Comparison Adjustment For Structures 

In addition to the Bonferroni multiple comparison adjustment performed to ac- 
count for multiple testing within a specific structure, we perform a second mul- 
tiple comparison adjustment to account for testing all 1100 structures. Since a 
single protein can have many structures that are similar to each other, a second 
Bonferroni adjustment is too conservative and an integrated Bonferroni-FDR 
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approach was performed. Specifically, for a given protein, the Bonferrroni ad- 
justed p- value of each cluster was multiplied by " ( " 2 ~ 1 - ) to calculate p*. Thus, p* 
could be compared directly to an a-level of 0.0 5 in order to deter mine the clus- 
ter's significance. Next, a rough FDR (rFDR) dGong et aZ.I . l2009h was obtained 
by calculating: 



rFDR = a 



k + 1 
2k 



where k = 3 x 1100 = 3300, the number of TSP insertion methods used mul- 
tiplied by the total number of structures in the study. Using a — 0.05, the 
rFDR w 0.025007. Rounding down, all the clusters for which p* < 0.025 were 
deemed to be significant. To avoid confusion in the rest of the paper, we only 
report the p- value (with the exception of Table [1]). However, each cluster dis- 
cussed in Section [3] is however significant after the Bonferroni-FDR multiple 
comparison adjustment described here. 



3 Results 

Using the GraphPAC algorithm, out of the 140 proteins analyzed, 223 (9), 225 
(10) and 226 (12) significant structures (proteins) were found under the cheap- 
est, nearest and farthest insertion methods, respectively. All three insertion 
methods of GraphPAC , as well as NMC and iPAC, all identify the 8 proteins 
shown in Table [T] to have significant clustering. However, when utilizing the ex- 
tra power gained by considering the three insertion methods described, Graph- 
PAC also identifies 5 additional proteins with significant clustering: EGFR 
(nearest insertion), DPP4 (farthest insertion), MAP2K4 (cheapest and near- 
est insertions), NRP1 (farthest insertion) and PCSK9 (farthest insertion). Of 
these 5 proteins, the only one identified by iPAC was EGFR. These 5 proteins 
correspond to a total of 6 structures, with two structures having significant clus- 
tering for EGFR. Furthermore, GraphPAC also identified the 2ENQ structure 
for the PIK3CA protein to have significant clustering, while NMC did not. It 
is important to note, that there were no proteins found to have significant clus- 
tering under the linear NMC algorithm that were subsequently missed by the 
GraphPAC algorithm. Further, while GraphPAC did not identify the EIF2AK2 
and HAOl proteins to have significant clustering (while iPAC did), GraphPAC 
did identify 4 additional proteins that were missed by iPAC, namely: DPP4, 
MAP2K4, NRP1, and PCSK9. For a full listing of which structure-protein com- 
binations were found significant, see "Results Summary" in the supplementary 
materials. 

As shown in Figure HJ each of the methods that incorporate the 3D tertiary 
structure show that for approximately 70% of all the structures found to have 
significant clustering, failure to utilize the tertiary information results in either 
an over or an underestimation of the number of clusters. Hence, failure to 
account for the protein structure may provide either an overly complicated or 
overly simplified view of the mutational orientation. 
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GraphPAC vs NMCvs iPAC Cluster Counts 




0% cheapest Nearest Farthest iPAC 
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Figure 4: A comparison of GraphPAC , iPAC and NMC over all the structures that were 
found to be significant. Each of the 3D methods are considered: all three GraphPAC insertion 
methods and iPAC. 



We note that 9 of the 13 proteins that GraphPAC identified as having signifi- 
cant clustering have their most significant cluster overlap a binding site, catalytic 
domain or kinase domain. Out of the remaining four proteins, three proteins 
have their most significant cluster fall within a previously identified biologically 
relevant region. For instance, the IDE protein's most significa nt cluster is be- 



tween residues 684-698, a denaturation- resistant epitope region (ICavender et al. 



1999). For the NRP1 pro tein, which p l ays r oles in angiogenesis (jJubb et al. 



2012 ) and axon guidance ( Maden et al. . 20121) . the most significant cluster di- 
rectly overlaps the F5 /8 type C 1 domain - a domain in many blood coagulation 
factors. Finally, for the PIK3C-a protein, the most significant cluster overlaps 
residue 1047 which has been shown t o potentially increase the substrate turnover 
rate, a common oncogenic behavior (jMankoo et all l2009l) . For a full description 
on the protein level, please see "Relevant Sites" in the supplementary materials. 
Lastly, we evaluated the performance of GraphPA C when compared t o two 



well-known machine learning algor ithms: CHASM ( Carter et al . 20091 ) and 



PolyPhen-2 (|Adzhubei et all l2010h . It is critical to note however, that the 



machine learning algorithms utilize a much more detailed set of information 
when evaluating the mutation. Thus these algorithms often identify mutations 
on residues that GraphPAC does not identify. Nevertheless, of all the mu- 
tations that fall within significant clusters identified by GraphPAC , 93% and 
91% of them were also identified as significant (FDR < 20%) by CHASM and 
PolyPhen-2 (respectively). For further details, see "Performance Comparison" 
in the supplementary materials. 
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GraphPAC 


NMC 


Protein 


P-value 


p* 


P-value 


p* 


KRAS 


4.21 E-233 


4.33 E-229 


4.39 E-233 


4.52 E-229 


TP53 


4.05 E-152 


4.48 E-147 


4.37 E-086 


5.30 E-81 


BRAF 


3.84 E-130 


1.04 E-126 


3.84 E-130 


1.04 E-126 


PIK3CA 


8.20 E-084 


3.58 E-080 


8.20 E-084 


3.58 E-080 


NRAS 


8.26 E-029 


9.91 E-027 


8.26 E-029 


9.91 E-027 


HRAS 


1.54 E-014 


6.94 E-013 


5.61 E-010 


8.42 E-009 


AKT1 


2.47 E-005 


2.47 E-004 


2.47 E-005 


7.41 E-005 


IDE 


1.56 E-003 


4.67 E-003 


1.56 E-003 


4.67 E-003 



Table 1: A comparison of 8 proteins that were found to be significant by both GraphPAC 
and NMC. The P* calculation is described in section fe.GI The smallest p-value from all of the 
insertion methods was selected. 



3.1 GraphPAC finds novel proteins 

As described in Section GraphPAC identified five additional proteins as com- 
pared to the linear NMC algorithm. In this section we will consider two of these 
proteins which are both directly cancer related: EGFR, which is also identified 
by iPAC, and NRP1, which is not identified by iPAC. 

EGF R is a cell-su rface receptor for ligands in the epidermal growth factor 
family (jHerbsti . 120041) and is present in a w ide range of diseases su ch as glioblas- 
toma multiforme ( Heimberger et al. 1 20051) . lung ad enocarcinoma ( Ladanvi and Pao , 
2008) and colorectal cancer ( Markman et ali. 2010f). T he most significant cluster 



found was in the 2ITX structure (|Yun et ali . l2007af) between residues 719-768 
(see Figure[5]) with a corresponding p-value of 0.0009. This cluster contains mu- 
tations G719S, T 751I and S768I which ar e all found in non-small cell lung card 



noma s (NSCLC) ( Yoshikawa et al. . 2012 : Simonetti et al. . 2010t Masaeo et al. 



2010h with mutation G719S well known for increased kinase activity (|Yun et al. 



2007bJ). It is also interesting to note that all three mutations within this clus- 



ter, which was identified purely through statistical cluste ring analysis, show a 



beneficial clinical response to either Erlotonib or Getfinib (|Kancha et all 12011 



Peraldo-Neia et al. . l201lUMasago et a/J . [2Qloh . Exclusion of the tertiary infor 



mation would have resulted in this cluster being missed. 

We now consider the NRP-1 protein, a coreceptor for the vascular endothelial 
growth factor ( VEGF) which is upregu lated in a large variety of can cers includ- 



ing lu ng tumors (jLantuioul et al. , 2003), gastrointestin al metasteses (jHansel et al. 



l2004h and pancreatic carcinomas (jParikh et ali . 120031 ). In NSCLC patients, it 
has been shown to be an independent predic tor of cancer relap se and reduced 



survival as well as a cancer invasion enhancer (|Hong et ali 120071 ) . Moreover, re- 



search has shown that NRP-1 inhibitors provide an additive effect to anti-VEGF 
therapy in reducing tumor progression and monoclonal antibodies that attach 
to the bl-b2 d omains, th e dom ains responsible for VEGF binding, have already 



been created (|Pan et ali . 120071 ). The bl domain, which spans residues 275-424 



almost exactly overlaps the most significant cluster found by GraphPAC , which 
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Figure 5: The EGFR ectodomain fragment structure (PDB ID 2ITX) where the 719-768 
cluster is colored in blue. The three mutations, 719,751 and 768 are displayed as purple 
spheres. 



consists of residues 277-432 (p- value 0.0158) in the 2QQI (jAppleton et all 120071 1 
structure (Figure EJ). Finally, it is worth noting that mutations on residues 297 
and 320 were recently found that completely disrupt VEGF binding, both of 
which also fall within the GraphPAC identified cluster of 277-432 in the 2QQI 
structure. 




Figure 6: The NRP-1 structure (PDB ID 2QQI) structure where the 277-432 cluster is 
colored in blue. The mutations that disrupt VEGF binding, 297 and 320 are shown in orange 
spheres while the end-points of the cluster, 277 and 432, are shown in purple spheres. 
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3.2 GraphPAC identifies additional clusters 



A representative example where GraphPA C identifies additional clusters as com- 
p ared to NMC an d iPAC is in the KRAS protein for the 3GFT structure 
Fl (|Tong et all l2009h (Figure [7J>. KRAS, a GTPase, is one of the most per- 



vasively activated oncogenes, with some estimates stating that bet ween 17-25% 
of all human tumors contain an activating mutation of the gene ([Kranenburg , 
2005f h Due to the large number of samples with mutations in this gene and the 



resulting strong statistical signal, GraphPAC , iPAC and NMC all identify that 
KRAS contains highly statistically significant mutational clusters. Nevertheless, 
GraphPAC identifies several novel clusters that are missed by iPAC and NMC. 
While all three methods identify clustering at residues 12-13, 12-61 and 12-146, 
only iPAC and GraphPAC identify a additional clusters at 1) 61-117 and 2) 
117-146. Moreover, only GraphPAC (under the cheapest and nearest insertion 
methods) identifies a statistically significant cluster for residues 12-23 and 23-61 
as shown in Tabled 











Residues 


NMC 


iPAC 


GraphPAC 


12-13 


9.45 E-229 


3.91 E-165 


8.95 E-229 


12-23 






1.31 E-99 


12-61 


4.34 E-65 


2.38E E-87 


5.49 E-164 


12-146 


3.85 E-13 


3.81 E-90 


2.87 E-16 


23-61 






1.01 E-105 


61-146 




3.01 E-106 


4.35 E-31 


117-146 




1.66 E-102 





Table 2: P- value comparison of the three algorithms for several significant clusters. A 
"-" signifies that the method did not find that cluster to be significant. For GraphPAC , the 
cheapest insertion results are reported here. 



As seen from the the table, GraphPAC uniquely highlights 2 specific areas 
of the protein: 1) 12-22 and 2) 22-61. Considering the 12-22 cluster, we see that 
a sub-cluster of 12-13 is identified as well. This follows biological function as 
mutations on residues 12 and 13 appear in a large variety of cancers, such as 



breast, lung, bladder, pan creas and colon (jMcCov et all 119841 : iMotoiima et al. 



1 19931: Isibi om et all . l2006h while mutations on residues 22 and 23 came from 
colorectal/large intestine tissue samples in our data. It is interesting to note that 
gcrmline mutations on residue 22 often result in developmental disorders such 
as Noo nan Syndrome Type 3 (NS3) as we l l as C ardiofaciocutaneous Syndrome 



(CFC) (jZenker et all l2007t iGremer et all 120111) 



1 For this analysis, a manual reconstruction was performed in order to include residue 61 
which is listed as a histidine under isoform 2B in the Uniprot Database and a glutamine in 
the COSMIC database. As the substitution of one amino acid in structure would not have a 
significant impact on the spatial structure of the protein, and residue 61 is a highly mutable 
position, the residue was kept in the analysis. As a result, amino acids 1 - 167 are used. 
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Finally, the majority of mutations in cluster 61-146 also fall along biological 
lines with all the mutations in our data either occurring in lung or gastroin- 
testinal tract/large intestine carcinomas. Specifically, resi due 61 is high l y mu- 
table with mutations found in colorectal and lung cancer (jSiblom et all 120061: 
Tarn et al. , 2006) while mutations K1 17N and A146T are found specifically in 



colorectal cancer (jSiblom et all 120061) 




Figure 7: The KRAS structure (PDB ID 3GFT) color coded by region: amino acids 13-22 
are blue, 24-60 are red and 62-145 are yellow. Residues 12 and 13 which make up the most 
significant cluster are shown as purple spheres, while residues 23, 61, 117 and 146 are shown 
as brown spheres. 



3.3 GraphPAC finds fewer clusters compared to NMC 

As seen from FigureJH between 30%-40% of the structures identified with signif- 
icant clustering had fewer clusters under the GraphPAC methodology as com- 
pared to the linear NMC algorithm with the vast majority of these structures 
corresponding to BRAF, HR AS and TP53. H ere we consider a representative 



example, the 4E26 structure (|Qin et all 120121 ) for BRAF when analyzed using 



the farthest insertion method (Figure H]). As iPAC identified even more clusters 
than NMC, we compare GraphPAC to NMC in this section when showing that 
fewer mutational clusters is of benefit. Further, as V600 is well known to be the 
most likely mutated position in BRAF, the most significant cluster identified by 
GraphPAC ', iPAC and NMC is located only on that residue with a p- value of 
2.12 x 10~ 129 under all three methods. In all, GraphPAC identifies 16 clusters 
while NMC identifies 22, with the differences shown in Table G3 

Although it is outside the scope of this manuscript to consider every differ- 
ence between Tables [3a] and I3bl we observe that three of the longest clusters 
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P-value 


Start 


End 


J£ Mnts 


GraphPAC 


NMC 


600 

UUU 


600 

UUU 


60 

UU 


2.12E-129 


2.12E-129 


597 


600 

UUU 


69 


1 49E-104 


1 49E-104 


uuu 


UU 1 


u^ 


1 49E-104 


9 22E-117 


Oc/U 


uuu 




7 16E-102 


7 16E-102 


OcJU 


UU1 


UU 


3.37E-91 


1.16E-100 


597 


601 


64 


8.07E-91 


7.16E-102 


601 


671 


3 


5.85E-38 




600 


671 


63 


8.30E-37 


7.08E-26 


469 


601 


72 


2.59E-22 


5.92E-17 


581 


601 


68 


1.23E-21 


1.33E-65 


581 


600 


66 


2.94E-20 


3.13E-63 


469 


600 


70 


3.98E-20 


4.91E-15 


466 


601 


74 


2.15E-17 


9.69E-19 


466 


600 


72 


7.01E-16 


1.60E-16 


464 


601 


75 


1.15E-15 


1.12E-19 


464 


600 


73 


2.33E-14 


2.97E-17 



(a) Clusters found by GraphPAC. A "-" for the NMC value sig- 
nifies that cluster was not identified under the linear algorithm. 



Start 


End 


# Muts. 


NMC Pvalue 


596 


671 


67 


4.12E-29 


597 


671 


65 


4.79E-27 


581 


671 


69 


3.33E-26 


464 


671 


76 


5.92E-09 


466 


671 


75 


3.32E-08 


469 


671 


73 


8.11E-07 



(b) Clusters found by NMC and dropped by Graph- 
PAC. 



Table 3: Table l3al shows the significant clusters that were identified by both GraphPAC 
and NMC. Table l3bl shows the significant clusters that were not found to be significant under 
GraphPAC but were found to be significant under NMC. 

464-671, 466-671 and 469-671 are dropped by GraphPAC . Since after alignment 
of the protein structural data to the mutational data fsee !2.2l) . tertiary informa- 
tion was available on residues 448-603 and 610-723, these clusters cover 74.8%, 
74.1% and 73.0% of the all the available residues, respectively. By consider- 
ing the 3D structure via GraphPAC , the longest clusters are dropped and the 
remaining overlapping clusters focus almost exclusively on residues 464-600. 

After structure and mutation alignment, the residue substitutions in signifi- 
cant clusters include: G464V, G466V, G469V, G469A, N581S, G596R, L597V, 
LV597R, V600E, V600K, K601N and R671Q. Since R671Q does not have ex- 
tensive literature and comes from a nons-specified tissue sample in the COS- 
MIC database, it will no longer be considered here. Thus, by considering the 
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Figure 8: The BRAF structure (PDB ID 4E26) color coded by segment: I) Amino 464-599 
are blue 2) Amino Acids 601-671 are green. The oj-carbons of the mutated residues 464, 466, 
469, 581 ,596, 597 , 601 and 671 are shown as purple spheres. Residue 600 is shown as a red 
sphere. 



tertiary structure, we significantly narrow the window of which residues to con- 
sider for potential driver mutations and can partition the protein into three 
segments: I) 464-599 and II) 600 and III) 601. Moreover, se gment I is pri- 



marily associated with lung and colorectal cancer as shown in (jGandhi et al.. . 
20091: iNaoki et all 120021 iGreenman et all 120071 : iDavies et all |2002j). Segment 



II represents the two most common mutations in BRAF, V600E and V600K. 
Overall, 95% of BRAF mutations occur on V600, with some studies show- 
ing that V600E occurs within 73% to 79% of patients while V 600K occurs 
within 12% to 19% of patients (jLovlv et all l2012t iMenzies et~ai\ , l2012h . Mu- 
tations at this position result in the oncogene being constitutively activated 
with increased kinase activity and has been found in a wide range of cancers 



such as metastatic me lanoma (jSosman et all , 12012 ), ovarian serous carcinoma 
( Grisham et al , 2013 ) and hairy cell leukemia ( Ewalt et al , 120121 ). Further- 



more, recent inhibitors, such as Vemurafenib and GSK2118436 specifically tar- 
get the V600E and V600E/K mutations (respectively), supporti ng the hypoth- 



esis t hat somatic clusters can provide pharmacological targets (jLemech et al. 



20111 ). Lastly, segment III is comprised of the much less common K601N muta- 



tion which has been observed in myeloma cases along with V600E. Since these 
patients share the more common BRA F mutations as well, th ey may also po- 



tentially benefit from BRAF inhibitors (jChapman et al. , 2011) 



4 Conclusion 

In this manuscript we provide an alternative method to utilize protein tertiary 
structure when identifying somatic mutation clusters. By employing a graph 
theoretic approach to restructuring the protein order, we identify both new 
clusters in proteins previously shown to have clustering as well as proteins that 
were not previously shown to have clustering. We have also provided several 
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examples where we are able to identify clusters of mutations that may benefit 
from pharmacological treatment. Moreover, as GraphPAC uses the NMC algo- 
rithm to identify clusters rather than a fixed window size, we are able to detect 
clusters of varying lengths. Finally, the methodology is fast and robust with 
the overwhelming majority of structure/protein combinations taking under 10 
minutes each to analyze on a consumer desktop with an Intel i7-3600k processor 
running at 3.40 GHZ and 16GB of DDR3 RAM. 

The GraphPAC algorithm, while presenting a viable alternative to the MDS 
restriction of IP AC and an improvement over NMC, nevertheless contains sev- 
eral limitations. First, while no longer bound to the MDS requirement of iPAC, 
there is no closed form solution to the shortest path problem and our algorithm 
must appeal to heuristic approximations. Further, while Figure [2] shows that for 
most structures all three insertion methods are quite similar in their rearrange- 
ment of the protein, for several structures the variability between methods can 
still be high. As such, future research is necessary to remove the one-dimension 
requirement and consider the protein in its native 3D state. 

Second, to satisfy the uniformity assumption the mutation status of all 
residues must be known ahead of time. With the growth of high-throughput 
sequencing however, this issue is temporary. Next, unequal rates of mutage- 
nesis along with hypermutability of specific genomic regions may violate the 
assumption that every residue has a uniform probability of mutation. To help 
ensure that this assumption holds, we only consider single residue missense sub- 
stitutions and have removed insertions and deletions from the analysis since 
they tend to be sequence dependent. Further, research has shown that CpG 
dinucleotides may have a mutational f r equen cy ten times or higher compared 
to other dinucleotides ( Sved and BirdL 1990h . However, in the analyses pre- 
sented in Sections 13.11 13.21 13.31 only approximately 13% of the mutations used 
to identify clustering occurred in CpG sites. Relatedly, colorectal carcinomas 
( Holl stein et al. I. ll99lh contain more transition mutations while cigarette use re- 
sults in more transversion mutations in lung carcinomas ( Ye et al. , 2010). Still, 



when considering KRAS, the overwhelming majority of substitutions occur on 
residues 12,13, and 61 for both colorectal and lung cancer, implying that while 
the mutational landscape may vary, it does not have a significant effect on mu- 
tation location and thus would not violate the uniformity assumption. Hence, 
while this analysis is influenced by a variety of factors, as are previous studies, 
it nevertheless appears that the primary cause of clustering is selection for a 
cancer phenotype. 

Further, as described in Section [31 GraphPAC finds fewer clusters for a sig- 
nificant percentage of the structures analyzed. Overall, the reduction in total 
clusters identified can result from two sources: the removal of some residues be- 
cause no tertiary data was available or the cluster is no longer significant when 
using the traveling salesman algorithm to account for 3D structure. The first 
case, which is already rare, will become increasingly more so as additional stud- 
ies result in more complete and detailed structural information. For the second 
case, if a cluster is not found to be significant under GraphPAC when compared 
to NMC, a near or overlapping cluster is usually found (see Tables l3al and [3b]) . 
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For BRAF specifically, under every type of graph insertion method (cheapest, 
nearest and farthest), every "probably damaging" or "possibly damaging" mu- 
tation (as classified by PolyPhen-2 was still identified in at least one significant 
cluster for the structure. For a complete analysis, see "Potential Driver Loss" 
in the supplementary materials. 

In summary, GraphPAC represents a novel technique to utilize protein ter- 
tiary structure via a graph theoretic approach in identifying mutational clus- 
tering. We show that this method identifies new clusters that are otherwise 
missed and that in some cases, pharmaceutical targets for these clusters have 
already been found and therapies created. This helps confirm the hypothesis 
that mutational clustering may be indicative of driver mutations and as new 
protein structures become available, GraphPAC provides a rapid methodology 
to identify such potential mutations. 
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